Skip to content

Doubled buckets. Part 0 (rebalancer batching + ref inconsistencies during master switch)#649

Open
Serpentian wants to merge 14 commits into
tarantool:masterfrom
Serpentian:doubled-buckets-part-0
Open

Doubled buckets. Part 0 (rebalancer batching + ref inconsistencies during master switch)#649
Serpentian wants to merge 14 commits into
tarantool:masterfrom
Serpentian:doubled-buckets-part-0

Conversation

@Serpentian
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian commented May 13, 2026

Closes #573
Closes #351
Closes #528
Closes #526

Serpentian and others added 6 commits April 13, 2026 15:50
The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See
tarantool/luatest#453 for details. Let's use 1.4.1 for now.

NO_DOC=ci
NO_TEST=ci
Before this patch we woke up the GC service instead of recovery service
in `bucket_recovery_pause`. It could lead to a longer tests' execution
time. Now, we fix it by changing `garbage_collector_wakeup` to
`recovery_wakeup`.

NO_TEST=refactoring
NO_DOC=refactoring
Before this patch the recovery service used functions, which determine
if the bucket can be recovered, one by one in `recovery_step_by_type`.
Moreover, the function was called from `recovery_service_f` with a lot
of code duplication. In the following patches the logic of recovery
service will become more complex. In order to preserve the code
readability it was decided to:

  1) Used for-loop based checking instead of one by one based checking
     of buckets' recoverability.

  2) Join the logic of saving recovered bucket ids and changing of
     `_bucket` space into one separate function - `bucket_recover`.

The functions `bucket_recover` and `bucket_status_has_destination` will
be used in the future commits.

Needed for tarantool#214
Needed for tarantool#573

NO_TEST=refactoring
NO_DOC=refactoring

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
This commit adds new READONLY bucket status. It's allowed to have old RW
refs, which were created before making the bucket READONLY, on top of
it. However, creating new ones is prohibited. It's the state, in which
we're waiting for RW requests to end before sending bucket. Later we'll
introduce waiting on replicas too.  For details, please
check out the following RFC: tarantool#620.

The READONLY state replaces the manual rw_lock-ing of the buckets during
bucket sending, so this code can be dropped. Also, the code for
`bucket_refrw_touch` can be also dropped: Before this commit ref has
always been created, because there was no other way to set `rw_lock` on
bucket other than ref. Now, we have "persisted rw_lock", which is the
READONLY bucket state, and we don't have to create the ref forcefully
anymore. However, we still must check, that the bucket is ok, while
sending the bucket, so during preparation phase of bucket send it was
replaced with `bucket_check_state()`.

The test from bucket_ref is removed, since it tests, that rebalancer
continues to work, when the storage has READONLY bucket status, which is
not true. Rebalancer must wait for 0 SENDING, RECEIVING and READONLY
buckets in the cluster and only after that it can try rebalancing one
more time.

Part of tarantool#573

NO_DOC=internal
The check for master is duplicated in the `bucket_send()` function. It
was done explicitly at the beginning of the function and then one more
time during `bucket_check_state()` (previously during
`bucket_refrw_touch()`).

Let's drop the first check. It is already tested in the
`storage_1_1_test.test_master_exclusive_api`.

NO_DOC=internal
NO_TEST=tested
This commit adds new functions, which are meant to be used instead of
the bucket_transfer_start/end() when sending a bucket.

This is needed for tarantool#573, where rebalancer will start to check the refs
on replicas, for which proper bucket batching is needed, since it'll be
way too costly to do that before every bucket sending.

For that to work we need to mark the bucket as `transfering` and block
new RW refs (`READONLY`) far earlier, than `bucket_send` happens. The
first one is required in order to simplify the process of picking
buckets and helps not to pick the same one twice. The second one is
essential, since we don't want new refs to happen, when we picked the
bucket for sending.

Needed for tarantool#573

NO_DOC=refactoring
NO_TEST=refactoring
Route dispenser is a simple container for routes, let's move it out from
the storage file to the separate one in the scope of tarantool#263. While
we're here let's also rewrite its test to luatest in the scope of tarantool#371.

Part of tarantool#263
Part of tarantool#371

NO_DOC=refactoring
NO_TEST=refactoring
@Serpentian Serpentian force-pushed the doubled-buckets-part-0 branch 3 times, most recently from d4e795f to 39b7141 Compare May 13, 2026 19:25
@Serpentian
Copy link
Copy Markdown
Collaborator Author

Serpentian commented May 13, 2026

  • Fixed not building the rocks bundle properly due to missing route_dispenser.lua in the CMake. Found in the integration testing.
  • Fixed flaky auto_master_2_2_2, which was found in the catch_flaky workflow

Comment thread test/rebalancer/receiving_bucket.result Outdated
This commit builds the ground for batching of buckets before sending in
the rebalancer workers, which is needed for tarantool#333, tarantool#573, tarantool#576. From now
on worker picks N buckets, prepares them in batch and only after that
all of the workers start sending these prepared buckets. The maximum
buckets, unavailable for writing during rebalancing remains
`rebalancer_max_sending`.

The test checks, that batching properly works, when error happens during
preparation stage. Error during sending stage is already tested with
rebalancer/errinj test. There's also the test, which checks, that in
case of slow wakeup, the worker continues its work properly.

Needed for tarantool#333
Needed for tarantool#573
Needed for tarantool#576

NO_DOC=refactoring
Before this commit the rebalancer picked the first encountered bucket
and waited for number of rw refs on it to become 0. This doesn't work
reliably, when cluster has very long RW requests, which block bucket
rebalancing for hours: the instance picks the same buckets and timeouts,
while waiting for no refs.

In order to fix that let's prefer buckets without refs. If there's not
enough such buckets, take the remainings from the first available buckets,
as it worked before.

Closes tarantool#351

NO_DOC=bugfix
@Serpentian Serpentian force-pushed the doubled-buckets-part-0 branch from 39b7141 to 7a62583 Compare May 15, 2026 10:58
@Serpentian Serpentian added the do not merge Not ready to be merged label May 15, 2026
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, these timeouts are beautiful ❤️.

Comment thread vshard/storage/init.lua Outdated
Comment on lines +2015 to +2019
opts = bucket_send_build_opts(table.deepcopy(opts), {
['timeout'] = consts.TIMEOUT_INFINITY,
['chunk_timeout'] = consts.DEFAULT_BUCKET_CHUNK_TIMEOUT,
['ref_timeout'] = consts.DEFAULT_BUCKET_REF_TIMEOUT,
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to store the defaults outside of this function - the table is always the same, we don't need to create it on each call.

We also don't need deepcopy here, just copy is enough, since we only change root fields of opts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can do that for bucket_send. Other ones - no go, since they're not static and depend on M

Comment thread vshard/storage/init.lua Outdated
local opts = bucket_send_build_opts(nil, {
['timeout'] = M.rebalancer_bucket_send_timeout,
['ref_timeout'] = consts.DEFAULT_BUCKET_REF_TIMEOUT,
['pop_timeout'] = consts.DEFAULT_BUCKET_POP_TIMEOUT,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you call here pop is rather sync. Pop is what we do, but sync is what we actually want on the high level, and what is being limited here. Pop sounds too low-level and specific to dispenser.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment thread vshard/storage/init.lua Outdated
-- The opts are recreated for every bucket to allow reconfiguring
-- of the timeout when routes applier already works.
local send_opts = bucket_send_build_opts(nil, {
['timeout'] = M.rebalancer_bucket_send_timeout,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might look simpler if you write the options as timeout = ... instead of ['timeout'] = .... The latter syntax usually is only needed when the name contains non-ID chars like - or .

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can do that, done

Comment thread vshard/storage/init.lua Outdated
local type = 'positive number'
for name, default in pairs(defaults) do
if opts[name] then
lcfg.check_option(name, type, nil, opts[name])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good idea to reuse the cfg here? Perhaps lets save lcfg.check_option as a local cfg_check_option = lcfg.check_option in the beginning of the file? Since it is used multiple times per bucket send.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done

Comment thread vshard/storage/init.lua Outdated
local function bucket_send_build_opts(opts, defaults)
assert(defaults['timeout'])
opts = opts or {}
local type = 'positive number'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend considering allowing here any number. Even negative. Otherwise it prevents writing linear code when the timeout was given to you from some other place, and you have to check it for > 0 every time + raise an error manually. An example you can see in util.fiber_cond_wait_xc() and util.future_wait_xc(). Netbox future and fiber cond require timeout >= 0, and this gets quite annoying, because sometimes we just want to let it raise timeout error when they get it < 0.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and using negatives finally stabilizes the timeout test. Using even tiny positive can lead to success

Comment thread vshard/storage/init.lua Outdated
local type = 'positive number'
for name, default in pairs(defaults) do
if opts[name] then
lcfg.check_option(name, type, nil, opts[name])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the third param is always nil, then maybe make lcfg.check_option just not take it, and only have 3 params?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as general as possible. If we'll want to use that function in the future for enums, we'll need that 3rd option. It's in the middle to be consistent with internal code of cfg module

Comment thread vshard/storage/init.lua Outdated
Comment on lines +2714 to +2715
['pop_timeout'] = consts.DEFAULT_BUCKET_POP_TIMEOUT,
['sync_timeout'] = consts.DEFAULT_BUCKET_SYNC_TIMEOUT,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I am proposing is that pop_timeout is removed. Only sync_timeout remains, because it basically what pop does - either takes a ready bucket, or waits for a sync.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okaaay, we can remove it from opts. But we still need it and I don't think using the sync_timeout in pop is a good idea. Now it's dynamic: M.rebalancer_worker_count * opts.ref_timeout + opts.sync_timeout

This commit reworks timeouts for manual `bucket_send()` and for the
rebalancer routes applier worker. Before this commit we had one single
timeout (for manual - DEFAULT_BUCKET_SEND_TIMEOUT, for rebalancer -
REBALANCER_CHUNK_TIMEOUT), which was used in all places without any
updating, consequently, it was very non-obvious, how much time the
bucket send will take.

From now on `bucket_send` accepts three different timeouts:
 - `timeout` - the max number of seconds the send can take;
 - `ref_timeout` - the max number of seconds waiting for no refs waits;
 - `chunk_timeout` - old `timeout` option, how many seconds sending one
   chunk of data can take;

The initial motivation for such approach is appearance of the
syncronization of the `_bucket` space before sending a bucket, which
will be added in the upcoming commits. It's very inconvenient to bump
the timeout in tests for them to timeout in proper phase.

The commit also introduces the new configuration option -
`rebalancer_bucket_send_timeout`. The timeout for sending one bucket
always was infinity, there was no way to limit it, this option should
solve that problem.

The commit also drops the `DEFAULT_BUCKET_RECV_TIMEOUT`, which was not
used in the code at all.

@TarantoolBot document
Title: Document rebalancer_bucket_send_timeout storage option
Since: VShard 0.1.41

The timeout in seconds for sending a single bucket inside rebalancer.
Can be used to limit the time, bucket can be unavailable for write
during bucket sending. Can be also used with `router.map_callrw` request
to guarantee, that the request will succeed, if timeout of request is >
`rebalancer_bucket_send_timeout`.

Default value is ininity.
Before this patch it was possible to break replication by long RW
requests in master switches: there was a master, which processed some RW
requests, after that the master was changed (which is pretty casual
thing, when rebalancing is in progress due to high load on storages),
then a new master started sending bucket, but replica has RW ref on it,
this breaks replication and requires manual intervention from the user
(replication becomes stopped and must be manually restarted).

In order to fix that we require all replicas to be running, when
rebalancing happens. Before making a bucket SENDING it becomes READONLY
and a master waits for RW requests on replicas to end. In READONLY state
it's prohibited to create new RW requests for that bucket. For details
see the tarantool#620 RFC.

The commit is also needed for fixing the doubled buckets during master
switch tarantool#576. By syncing with all replicas we make sure, that all of them
has the bucket at least in READONLY state. So, after master switch
happens the bucket router won't be able to send RW requests to it.
Together with service, which synchronizes the `_bucket` space before any
recovery/gc/rebalancing happens, it will solve the tarantool#576.

Note, that from now on in some tests we should use `chunk_timeout`
intsead of `timeout`, since otherwise we can timeout on waiting for
synchronization and not sending a data.

Additionally, this commit introduces `sync_timeout` for `bucket_send()`.

Closes tarantool#573
Part of tarantool#576

NO_DOC=bugfix
The test is pretty flaky and fails with error "Duplicate key exists in
unique index 'pk' in space '_bucket'", when trying to force create
buckets.

Let's properly wait for all buckets to be garbage collected and not just
one, as it was before.

NO_DOC=test
Sometimes the test failed with `Not found` error, when trying to send
bucket from `replica_2_a` right after its start. The problem there is
that firstly the node should get that bucket from `replica_2_b` via
replication.

It'll be fixed automatically with introduction of the service, which
wait for `_bucket` synchronization, but we're not there yet, so let's
fix it manually.

Closes tarantool#528

NO_DOC=test
Sometimes the test case `test_basic_storage_service_info` failed with
`routes_applier_service is still nil` error. It sends a bucket and then
waits for rebalancer to send that bucket back and sometimes rebalancer
cannot start and waits, since recovery is not finished yet.

Let's wait for recovery and gc to work before enabling the rebalancer to
fasten up the process of creation apply routes worker, which are checked
in that test case.

Closes tarantool#526

NO_DOC=test
@Serpentian Serpentian force-pushed the doubled-buckets-part-0 branch from 7a62583 to 6d1fc8d Compare May 15, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Not ready to be merged full-ci

Projects

None yet

3 participants